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Abstract 

Answer Set Programming (ASP) is a powerful logic-based programming language, which is enjoy- 
ing increasing interest within the scientific community and (very recently) in industry. The evaluation 
of ASP programs is traditionally carried out in two steps. At the first step an input program V under- 
. goes the so-called instantiation (or grounding) process, which produces a program V' semantically 

equivalent to V, but not containing any variable; in turn, V 1 is evaluated by using a backtracking 
search algorithm in the second step. It is well-known that instantiation is important for the efficiency 
CNj ■ of the whole evaluation, might become a bottleneck in common situations, is crucial in several real- 

world applications, and is particularly relevant when huge input data has to be dealt with. At the 
time of this writing, the available instantiator modules are not able to exploit satisfactorily the lat- 
est hardware, featuring multi-core/multi-processor SMP (Symmetric Multiprocessing) technologies. 
This paper presents some parallel instantiation techniques, including load-balancing and granular- 
ity control heuristics, which allow for the effective exploitation of the processing power offered by 
modern SMP machines. This is confirmed by an extensive experimental analysis herein reported. 
To appear in Theory and Practice of Logic Programming (TPLP). 
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1 Introduction 

Answer Set Programming (ASP) dGelfond and Lifschitz 199lllEiter et al. 1997HLifschitz 19991: 



Marek and Truszczynski 1999 IBar al 2003, Gelfo nd and Leone 20021 1 is a purely declara- 
tive programming paradigm based on nonmonotonic reasoning and logic programming. 
The language of ASP is based on rules, allowing (in general) for both disjunction in 
rule heads and nonmonotonic negation in the body. The idea of answer set programming 
is to represent a given computational problem by a logic program such that its answer 
sets correspond to solutions, and then, use an answer set solver to find such solutions 
dLifschitz 19 99). The main advantage of ASP is its declarative nature combined with a 
relatively high expressive power (ILeone et al. 20061 IDantsin eT al. 2001); but this comes 
at the price of a elevate computational cost, which makes the implementation of efficient 
ASP systems a difficult task. Some effort has been made to this end, and, after some pi- 
oneering work (Bel l et al. 19941 [Subrahman ian et al. 1 995). there are nowadays a num- 
ber of systems that support ASP and its variants ( Leone et al. 20 06 ; Janh unen et al. 20061 
ISimons et al. 20021lGebs"er et al. 20071ILin and Zhao 20041lLierTer and Maratea 20041|Anger et al. 200l|> 
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The availability of efficient systems made ASP profitably exploitable in real-world applica- 
tions (ILembo et al. 20021 lAiello an d Massac cl 20011 Baral and Uyan 200 \) and, recently, 



also in industry (Grasso ^t al. 20101 Ielpa et al. 2009 Gra sso et al. 201 it 



Traditionally, the kernel modules of such systems operate on a ground instantiation of 
the input program, i.e. a program that does not contain any variable, but is semantically 
equivalent to the original input. Therefore, an input program V first undergoes the so-called 
instantiation process, which produces a ground program V' semantically equivalent to V . 
This phase, which is fundamental for some real world applications where huge amounts 
of input data have to be handled, is computationally expensive (see dEiter et al. 19971 
IDantsin et al. 20 01 )): and, nowadays, it is widely recognized that having an efficient instan- 
tiation procedure is crucial for the performance of the entire ASP system. Many optimiza- 
tion techniques have been proposed for this purpose (Fabe r et al. 19991 ILeone et al. 20011 
ILeone et al. 20041) ; nevertheless, the performance of instantiators can be further improved 
in many cases, especially when the input data are significantly large (real-world instances, 
for example, may count hundreds of thousands of tuples). 

In this scenario, significant performance improvements can be obtained by exploiting 
modern (multi-core/multi-processor) Symmetric Multi Processing (SMP) (Stallings 1998 ) 



machines featuring several CPUs. In the past only expensive servers and workstations sup- 
ported this technology, whereas, at the time of this writing, most of the personal com- 
puters systems and even laptops are equipped with (at least one) dual-core processor. 
This means that the benefits of true parallel processing can be enjoyed also in entry-level 
systems and PCs. However, traditional ASP instantiators were not developed with multi- 
processor/multi-core hardware in mind, and are unable to exploit fully the computational 
power offered by modern machines. 

This paper presents^ some advanced techniques for the parallel instantiation of ASP 
programs, implemented in an instantiator system allowing the exploitation of the compu- 
tational power offered by multi-core/multi-processor machines. The system is based on 
the state-of-the-art ASP instantiator of the DLV system (ILeone et al. 20061 ): moreover, it 
extends the work of (Calimeri et al. 20081 by introducing a number of relevant improve- 
ments: (i) an additional third stage of parallelism for the instantiation of every single rule 
of the program, (ii) dynamic load balancing, and (in) granularity control strategies based 
on computationally cheap heuristics. In this way, the efficacy of the system is no longer 
limited to programs with many rules (as in (Calimeri et al. 20081 ). and also the particu- 
larly (common and) difficult-to-parallelize class of programs with few rules is handled in 
an effective way. Moreover, we developed a new implementation supporting a richer input 
language (e.g. aggregates) and technical improvements in thread management. 

An extensive experimental activity is also reported, which was carried out on a variety 
of publicly-available benchmarks already exploited for evaluating the performance of in- 
stantiation systems (Gebser et al. 20071 IDenecker et al. 200"9l ILeone et al. 20061 ). A com- 
parison with (Calimeri et al. 2008 ) shows that the new techniques both combine with the 
previous ones and allow for a parallel evaluation even in cases where previous techniques 
were not applicable. 



1 Preliminary results have been presented in I Perri et al. 2008 , Perri et al. 2010 1. 
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A scalability analysis demonstrates that the new parallel instantiator behaves very well 
in all the considered instances: superlinear speedups are observed in the case of easy-to- 
parallelize problem instances; and, nearly optimal efficiencies are measured in the case of 
hard-to-parallelize problem instances (its efficiency remains stable when the size of the in- 
put problem grows). Importantly, the system offers a very good performance already when 
only two CPUs are enabled (i.e. for the majority of the commercially available hardware 
at the time of this writing) and efficiency remains at a good level when more CPUs are 
enabled. 

The remainder of the paper is structured as follows: Section [2] outlines some basic no- 
tions of Answer Set Programming; Section[3]describes the employed parallel instantiation 
strategies; Sections|4]and|5]present heuristics, load balancing and granularity control; Sec- 
tion|6]discusses the results of the experiments; finally, Section|7]is devoted to related works, 
and Section |8]draws some conclusions. 

2 Answer Set Programming 

In this section, we provide a formal definition of syntax and semantics of ASP programs. 
Syntax. A variable or a constant is a term. An atom is p(t\, t n ), where p is a predicate 
of arity n and t\, t n are terms. A literal is either a positive literal p or a negative literal 
not p, where p is an atom. A (disjunctive) rule r has the following form: 

oi V ... V a n :-&i,... , 6 fc ,not b k +i, ■ ■ ■ , not 

where ai, . . . , a n) b\, . . . , b m are atoms. The disjunction a\ V . . . V a n is the head of r, 
while the conjunction b\, . . . , bk, not bk+i, . . . , not b m is the body of r. 

We denote by H(r) the set {a\ , . . . , a n } of the head atoms, and by B(r) the set {b\ , . . . , 
bk, not bk+i, ■ ■ ■ ,not b m } of the body literals. B + (r) (resp., B~(r)) denotes the set of 
atoms occurring positively (resp., negatively) in B(r). 

A rule having precisely one head literal (i.e. n = 1) is called a normal rule. If the body 
is empty (i.e. k = rn = 0), it is called a. fact, and we usually omit the :- sign. A rule 
without head literals (i.e. n = 0) is usually referred to as an integrity constraint!^ A rule r 
is safe if each variable appearing in r appears also in some positive body literal of r. 

An ASP program V is a finite set of safe rules. A not -free (resp., V-free) program is 
called positive (resp., normal). A term, an atom, a literal, a rule, or a program is ground if 
no variables appear in it. A predicate p is referred to as an EDB predicate if, for each rule 
r having in the head an atom whose name is p 6 H(r), r is a fact; all others predicates are 
referred to as IDB predicates. The set of facts in which EDB predicates occur, denoted by 
EDB(V), is called Extensional Database (EDB), the set of all other rules is the Intensional 
Database (IDB). 

Semantics. Let V be an ASP program. The Herbrand universe of V, denoted as U-p, is the 
set of all constants appearing in V . In the case when no constant appears in V, an arbitrary 
constant tp is added to U-p. The Herbrand base of V, denoted as B-p, is the set of all 

2 Note that a constraint is a shorthand for false :—bi, . . . , b^, not b^-i-i, . . . , not b m ., and it is also as- 
sumed that a rale bad -.-false, not bad is in the program, where false and bad are special symbols appearing 
nowhere else in the original program. 
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ground atoms constructible from the predicate symbols appearing in V and the constants 
of Up. Given a rule r occurring in a program V, a ground instance of r is a rule obtained 
from r by replacing every variable X in r by cr{X), where a is a substitution mapping the 
variables occurring in r to constants in Up. We denote by ground(V) the set of all the 
ground instances of the rules occurring in V. 

An interpretation for V is a set of ground atoms, that is, an interpretation is a subset I 
of Bp. A ground positive literal A is true (resp., false) w.r.t. I if A € I (resp., A £ I). 
A ground negative literal not A is true w.r.t. / if A is false w.r.t. /; otherwise not A is 
false w.r.t. /. Let r be a rule in ground(V). The head of r is true w.r.t. / if H(r) D I ^ 0. 
The body of r is true w.r.t. / if all body literals of r are true w.r.t. / (i.e., B + (r) C J and 
B~{r) fl J = 0) and otherwise the body of r is false w.r.t. /. The rule r is satisfied (or 
true) w.r.t. / if its head is true w.r.t. / or its body is false w.r.t. /. A model for V is an 
interpretation M for V such that every rule r S ground(V) is true w.r.t. M. A model M 
for is minimal if there is no model N for P such that N is a proper subset of M. The set 
of all minimal models for V is denoted by MM(P). 

In the following, the semantics of ground programs is first given, then the semantics of 
general programs is given in terms of the answer sets of its instantiation. 

Given a ground program V and an interpretation /, the reduct of V w.r.t. / is the subset 
V 1 of V obtained by deleting from V the rules in which a body literal is false w.r.t. /. 

Note that the above definition of reduct, proposed in (Fa ber et al. 20 04). simplifies the 
original definition of Gelfond-Lifschitz (GL) transform (Gelfo ndand Lifschitz 199 II I, but 
is fully equivalent to the GL transform for the definition of answer sets (IFaber et al. 2 004). 

Let / be an interpretation for a ground program V . I is an answer set (or stable model) 
for V if I G MM(P') (i.e., I is a minimal model for the program V 1 ) (IFaber et al. 2004l >. 
The set of all answer sets for V is denoted by ANS(V). 



3 Parallel Instantiation 

In this section we describe an algorithm for the instantiation of ASP programs, which 
exploits parallelism in three different steps of the instantiation process. In particular, the 
algorithm employs techniques presented in (Cali meri et al. 2 008) and integrates them with 
a novel strategy which has a larger application field, covering many situations in which the 
previous techniques do not apply. More in detail, the parallel instantiation algorithm allows 
for three levels of parallelism: components, rules and single rule level. The first level allows 
for instantiating in parallel subprograms of the program in input: it is especially useful 
when handling programs containing parts that are, somehow, independent. The second one 
allows for the parallel evaluation of rules within a given subprogram: it is useful when the 
number of rules in the subprograms is large. The third new one, allows for the parallel 
evaluation of a single rule: it is crucial for the parallelization of programs with few rules, 
where the first two levels are almost not applicable. In the following, we first provide an 
overview of our approach to the parallel instantiation process, giving an intuition of the 
three aforementioned levels and then we illustrate the instantiation algorithm. 
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3.1 Overview of the Approach 

Components Level (Cali meri et al. 2 008 ). The first level of parallelism, called Components 
Level, consists in dividing the input program V into subprograms, according to the depen- 
dencies among the IDB predicates of V, and by identifying which of them can be evaluated 
in parallel. More in detail, each program V is associated with a graph, called the Depen- 
dency Graph of V, which, intuitively, describes how IDB predicates of V depend on each 
other. For a program V, the Dependency Graph of V is a directed graph Gp = (N, E), 
where N is a set of nodes and E is a set of arcs. N contains a node for each IDB predicate 
of V, and E contains an arc e = (p, q) if there is a rule r in V such that q occurs in the 
head of r and p occurs in a positive literal of the body of r. 

The graph G-p induces a subdivision of V into subprograms (also called modules) al- 
lowing for a modular evaluation. We say that a rule reP defines a predicate p if p appears 
in the head of r. For each strongly connected component (SCC) C of G-p, the set of rules 
defining all the predicates in C is called module of C. Moreover, a partial ordering among 
the SCCs is induced by Gp, defined as follows: for any pair of SCCs A, B of Gp, we say 
that B directly depends on A if there is an arc from a predicate of A to a predicate of B; 
and, B depends on A if there is a path in Gp from A to B. 

According to such definitions, the instantiation of the input program V can be car- 
ried out by separately evaluating its modules; if the evaluation order of the modules re- 
spects the above mentioned partial ordering then a small ground program is produced 
(ICalimeri et al. 2 008 ). Indeed, this gives the possibility of computing ground instances of 
rules containing only atoms that can possibly be derived from V (thus, avoiding the combi- 
natorial explosion that can be obtained by naively considering all the atoms in the Herbrand 
base). 

Intuitively, this partial ordering guarantees that a component A precedes a component B 
if the program module corresponding to A has to be evaluated before the one of B, because 
the evaluation of A produces data that are needed for the instantiation of B. Moreover, the 
partial ordering allows for determining which modules can be evaluated in parallel. In- 
deed, if two components A and B, do not depend on each other, then the instantiation of 
the corresponding program modules can be performed simultaneously, because the instan- 
tiation of A does not require the data produced by the instantiation of B and vice versa. 
The dependency among components is thus the principle underlying the first level of par- 
allelism. At this level subprograms can be evaluated in parallel, but still the evaluation of 
each subprogram can be further parallelized. 

Rules Level (Calimeri et al. 2008). The second level of parallelism, called the Rules Level, 
allows for concurrently evaluating the rules within each module. A rule r occurring in 
the module of a component C (i.e., defining some predicate in C) is said to be recursive 
if there is a predicate p E C occurring in the positive body of r; otherwise, r is said to 
be an exit rule. Rules are evaluated following a semi-naive schema (Ullma n~1989l) and 
the parallelism is exploited for the evaluation of both exit and recursive rules. More in 

3 A strongly connected component of a directed graph is a maximal subset of the vertices, such that every vertex 
is reachable from every other vertex. 
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detail, for the instantiation of a module M, first all exit rules are processed in parallel by 
exploiting the data (ground atoms) computed during the instantiation of the modules which 
M depends on (according to the partial ordering induced by the dependency graph). Only 
afterward, recursive rules are processed in parallel several times by applying a semi-naive 
evaluation technique in which, at each iteration n, the instantiation of all the recursive 
rules is performed concurrently and by exploiting only the significant information derived 
during iteration n — 1. 

Single Rule Level. The first two levels of parallelism are effective when handling large pro- 
grams. However, when the input program consists of few rules, their efficacy is drastically 
reduced, and there are cases where components and rules parallelism are not exploitable at 
all. For instance the following program V encoding the well-known 3-colorability problem: 

(r) col(X,red) V col(X , yellow) V col(X, green) :- node(X). 
(c) :- edge{X,Y),col(X,C), col{Y,C). 

The two levels of parallelism described above have no effects on the evaluation of V. 
Indeed, this encoding consists of only two rules which have to be evaluated sequentially, 
since, intuitively, the instantiation of (r) produces the ground atoms with predicate col, 
which are necessary for the evaluation of (c). 

For the instantiation of this kind of programs a third level is necessary for the parallel 
evaluation of each single rule, which is therefore called Single Rule Level. 

In the following we present a strategy for parallelizing the evaluation of a rule. The 
idea is to partition the extension of a single rule literal (hereafter called split literal) into a 
number of subsets. Thus the rule instantiation is divided into a number of smaller similar 
tasks each of which considers as extension of the split literal only one of those subsets. For 
instance, the evaluation of rule (c) in the previous example can be performed in parallel by 
partitioning the extension of one of its literals, let it be edge, into n subsets, thus obtaining 
n instantiation tasks for (c), working with different ground instances of edge. Note that, in 
general, several body literals are possible candidates to be split up (for instance, in the case 
of (c), col can be split up instead of edge) and the choice of the most suitable literal to split 
has to be carefully made, since it may strongly affect the cost of the instantiation of rules. 
Indeed, a "bad" split might reduce or neutralize the benefits of parallelism, thus making the 
overall time consumed by the parallel evaluation not optimal (and, in some corner cases, 
even worse than the time required to instantiate the original encoding). Note also that, the 
partitioning of the extension of the split literal has to be performed at run-time. Indeed, if 
the predicate to split is an IDB predicate, as in the case of col, the partitioning can be made 
only when the extension of the IDB predicate has already been computed; in our example, 
only after the evaluation of rule 



4 An example of workload distribution is reported in Appendix C. 
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// First level of parallelism 

Procedure Components _Instantiator(P: Program; Gv'- DependencyGraph; var II: GroundProgram) 
begin 

var 5": SetOfAtoms; var C: SetOfPredicates; 
S = EDB(P)-n~ 0; 

while Gv 7^ do // until there are components to be processed 

take a SCC C from Gv that does not depend on other SCCs of Gv 

Spawn(RulesJnstantiator, V, C, S, II, Gv) 
end while 
end; 

// Second level of parallelism 

Procedure Rules Jlnstantiator(V: Program; C: Component; var S: SetOfAtoms; 

var II: GroundProgram, var Gv' DependencyGraph) 

begin 

var AS", MS: SetOfAtoms; 

AS := 9; MS := ; 

for each r £ Exit(C, V) do //evaluation of exit rules 

X r = Spawn (SingleRule_Instantiator, r, S, AS,MS, II); 
for each r £ Exit(C, V) do //synchronization barrier 

join_withJhread(T r ) ; 

do 

AS ~ MS; MS ■- ; 

for each r £ Recursive(C, V) do // evaluation of recursive rules 

Ir = Spawn (SingleRule_Instantiator, r, S, AS, MS, TV) ; 
for each r £ Recur sive(C ', T 3 ) do //synchronization barrier 

join_withjhread(T r ) ; 
S := SU AS; 

while MS 7^ // until no new information can be derived 

Remove C from Gv ; // to process C only once 

end Procedure; 



Fig. 1. Components and Rules parallelism 



3.2 The Algorithms 

The algorithms for the three levels of parallelism mentioned above are shown in Fig.Q] and 
Fig. |2] They repeatedly apply a pattern similar to the classical producer-consumers prob- 
lem. A manager thread (acting as a producer) identifies the tasks that can be performed 
in parallel and delegates their instantiation to a number of worker threads (acting as con- 
sumers). More in detail, the Components Jnstantiator procedure (see Fig. [TJ, acting as a 
manager, implements the first level of parallelism, that is the parallel evaluation of program 
modules. It receives as input both a program V to be instantiated and its Dependency Graph 
GV; and it outputs a set of ground rules II, such that ANS{V) = ANS{U U EDB{T>)). 
First of all, the algorithm creates a new set S of atoms that will contain the subset of the 
Herbrand base significant for the instantiation; more in detail, S will contain, for each pred- 
icate p in the program, the extension of p, that is, the set of all the ground atoms having the 
predicate name of p (significant for the instantiation). 
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II Third level of parallelism 

II A split is virtually identified by four iterators to S and AS identifying ranges of instances. 
struct VirtualSplit { Atomslterator S -begin, S-end, AS -begin, AS -end; } 

Procedure SingleRuleJnstantiator(r: Rule; S: SetOfAtoms; AS: SetOfAtoms; AfS: SetOfAtoms; 

var IT: GroundProgram) 
var s := numberOfSplits(_B(r), S, AS); II according to load balancing and granularity control 
SelectLiteralToSplit(i, B(r),s); II according to a heuristics 

var vector<VirtualSplit> Splits[s]; II create virtual splits 
SplitExtension(L, s, S, AS, Splits); II distribute extension of L 

for each sp in Splits II spawn threads running InstantiateRule for each split 

I sp = Spawn (InstantiateRule, r, L, sp, S, AS, AfS, II); 
for each sp in Splits do // synchronization barrier 

join-withjhread(X ap ) ; 
end Procedure; 

Procedure SplitExtension(L: Literal; s:integer; S: SetOfAtoms; AS: SetOfAtoms; 

var vector<VirtualSplit> Splits) 
I* Given a literal L to split, a number s of splits to be produced, virtually partitions the extension 

of L stored in S and AS by determining a vector Splits of VirtualSplit structures identifying 

ranges of instances to be considered in each split. 

*/ 

Procedure InstantiateRule(r: rule; L: Literal; sp: VirtualSplit; S: SetOfAtoms; AS: SetOfAtoms; 
var ATS: SetOfAtoms; var II: GroundProgram) 

/* Given S and AS builds all the ground instances of r, adds them to II, and add to J\fS 

the new head atoms of the generated ground rules. For L only the ground atoms belonging to 
the ranges {S-Begin,S-End} and {AS -begin, AS -end} indicated by sp are used . 

*l 

Fig. 2. Single Rule parallelism 



Initially, S = EDB(V), and II = 0. Then, the manager checks whether some SCC 
C can be instantiated; in particular, it checks whether there is some other component C 
that has not been evaluated yet and such that C depends on C . As soon as a component 
C is processable, a new thread is created, by a call to threading function Spawn, running 
procedure Rules Jnstantiator. 

Procedure Rules Jnstantiator (see Fig.[T]i, implementing the second level of parallelism, 
takes as input, among the others, the component C to be instantiated and the set S; for each 
atom a belonging to C, and for each rule r defining a, it computes the ground instances of 
r containing only atoms that can possibly be derived from V. At the same time, it updates 
the set S with the atoms occurring in the heads of the rules of II. To this end, each rule r 
in the program module of C is processed by calling procedure SingleRule Jnstantiator. 

It is worth noting that exit rules are instantiated by a single call to SingleRule Jnstantiator, 
whereas recursive ones are processed several times according to a semi-naive evaluation 
technique ( Ullman 1989), where at each iteration n only the significant information derived 
during iteration n — 1 is used. This is implemented by partitioning significant atoms into 
three sets: AS, S, and AfS. AfS is filled with atoms computed during current iteration (say 



Theory and Practice of Logic Programming 



9 



n); AS contains atoms computed during previous iteration (say n — 1); and, S contains the 
ones previously computed (up to iteration n — 2). 

Initially, AS and J\fS are empty; the exit rules contained in the program module of 
C are evaluated and, in particular, one new thread identified by l r for each exit rule r, 
running procedure SingleRuleJnstantiator, is spawned. Since the evaluation of recursive 
rules has to be performed only when the instantiation of all the exit rules is completed, 
a synchronization barrier is exploited. This barrier is encoded (a la POSIX) by several 
calls to threading function join_withjhread forcing Rules Jnstantiator to wait until all Sin- 
gleRuleJnstantiator threads are done. Then, recursive rules are processed (do- while loop). 
At the beginning of each iteration, AfS is assigned to AS, i.e. the new information derived 
during iteration n is considered as significant information for iteration n+1. Then, for each 
recursive rule, a new thread is spawned, running procedure SingleRuleJnstantiator, which 
receives as input S and AS; when all threads terminate (second barrier), AS is added to 
S (since it has already been exploited). The evaluation stops whenever no new information 
has been derived (i.e. MS = 0). Eventually, C is removed from G-p. 

The third level of parallelism (see Fig. |2), concerning the parallel evaluation of each 
single rule, is then implemented by Procedure SingleRuleJnstantiator, which given the 
sets S and AS* of atoms that are known to be significant up to now, builds all the ground 
instances of r and adds them to II. Initially, SingleRuleJnstantiator selects, according to a 
heuristics for load balancing (see Section|5]l the number s of parts in which the evaluation 
has to be divided; then SingleRuleJnstantiator heuristically selects a positive literal to 
split in the body of r, say L (see Section|4|. A call to function SplitExtension (detailed in 
Appendix A) partitions the extension of L (stored in S and AS) into s equally sized parts, 
called splits. In order to avoid useless copies, each split is virtually identified by means of 
iterators over S and AS, representing ranges of instances. More in detail, for each split, 
a VirtualSplit is created containing two iterators over S (resp. AS), namely SJ)egin and 
S-end (resp. AS-begin and AS-end), indicating the instances of L from S (resp. AS) 
that belong to the split. Note that, in general, a split may contain ground atoms from both 
S and AS. Once the extension of the split literal has been partitioned, then a number of 
threads running procedure InstantiateRule, are spawned. InstantiateRule, given S and AS* 
builds all the ground instances of r that can be obtained by considering as extension of the 
split literal L only the ground atoms indicated by the iterators in the virtual split at hand. 
SingleRuleJnstantiator terminates (last barrier) once all splits are evaluated. 

The correctness of the algorithm follows from the consideration that, whatever the split 
literal L, the union of the outputs of all the s concurrent InstantiateRule procedures is the 
same as the output produced by a single call to InstantiateRule working with the entire 
extension of L (s = 1). Note that, if the split predicate is recursive, its extension may 
change at each iteration. This is considered in our approach by performing different splits 
of recursive rules at each iteration. This ensures that at each iteration the virtual splits are 
updated according to the actual extension of the literal to split. 

In addition, this choice has a relevant side-effect: at each iteration the workload is dy- 
namically re-distributed among instantiators, thus inducing a form of dynamic load bal- 
ancing in case of the evaluation of recursive rules. 
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4 Selection of the Literal to Split 

Concerning the selection of the literal to split, the choice has to be carefully made, since 
it may strongly affect the cost of the instantiation of rules. It is well-known that this cost 
strongly depends on the order of evaluation of body literals, since computing all the possi- 
ble instantiations of a rule is equivalent to computing all the answers of a conjunctive query 
joining the extensions of literals of the rule body. However, the choice of the split literal 
may influence the time spent on instantiating each split rule, whatever the join order. In the 
light of these observations, we have devised a new heuristics for selecting the split literal 
given an optimal ordering (which can be obtained as explained in dLeone et al. 200 it ). 

Intuitively, suppose we have a rule r containing n literals in the body in a given order, 
and suppose that any body literal allows for the target number of splits, say s, then: to 
obtain work for s threads it is sufficient to split the first literal (whatever the join order); 
nonetheless, moving forward, say splitting the third literal, would cause a replicate eval- 
uation of the join of the first two literals in each split thread possibly increasing parallel 
time. It is easy to see that the picture changes if all/some body literal does not allow for 
the target number of splits, in this case one should estimate the cost of splitting a literal 
different from the first and select the best possible choice. 

In the following, we first introduce some metrics for estimating the work needed for 
instantiating a given rule, and then we describe the new heuristics. In detail, we use the 
following estimation for determining the size of the joins of the body literals: given two 
relations R and S, with one or more common variables, the size of R 1X1 S can be estimated 
as follows: 

T(RMS) = T(R)-T(S) 

1 ' Ylx^ariB^ariS) ™X {V (X, R),V(X, S)} 

where T (R) is the number of tuples in R, and V (X, R) (called selectivity) is the number 
of distinct values assumed by the variable X in R. Given an evaluation order of body 
literals, one can repeatedly apply this formula to pairs of body predicates for estimating 
the size of the join of a body. A more detailed discussion on this estimation can be found 
in dUllman 1989b . 

Let r be a rule with n body literals L±, L%, . . . , L n , where Li precedes Lj for each i < j 
in a given evaluation order, an estimation of the cost of instantiating the first k literals in 

B(r) is: 

fO if k < 2 

C(k) = I T(Li) ■ T(L 2 ) ifk = 2 (2) 
{C(k - 1) + T(ii IX] ... IX L k _x) ■ T(k) if k > 2 

Now, let s be the number of splits to be performed; the following is an estimation of the 
work of the instantiation tasks obtained by the split of the i-th literal Lc 

C = C(w) - C( '- 1) +C(i-1), l< l <n (3) 
s l 

where, s l is equal to s (if the extension of Li is sufficiently large) or the maximum number 
of splits allowed by Li. Intuitively, if Li is the split literal, the work of each instantiation 
task is composed of two parts: a part to be performed serially, common to all tasks, which 
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consists in the instantiation of the first i — 1 literals, whose cost is represented by C(i — 1); 
and the instantiation of the remaining literals, which is divided among the s l tasks, whose 
cost is represented by c ( n )-c( 1 - 1 ') m The estimation C l can be used for determining the split 
literal, by choosing the one with minimum cost. 

Note that, in the search for the best one, we can skip over each body literal Lfc, with 
k > j, if Lj allows for s splits since C- 7 < C k holds. Indeed, if n = 2, C 1 = C 2 ; while for 
n > 3, k = j + 1 and = s k = s (worst case) we have that 

, C(n)-C(k-1) „„ \ C(n)-C(j) 
C = — r - + C(fc-1) = — +C(j). 

S K s 

By applying (O: 

Qk= C{n)-CV-l) _Q j + s_-l 

s s s 

where Q = T(L\ M • • • X Lj-i) ■ T(j). Thus, by induction, if Lj allows for s splits, C j 
< C k , for k > j. Intuitively, this can be explained by considering that splitting a literal 
Lfe after one allowing for s splits Lj has the effect of evaluating serially the join of literals 
between Lf. and Lj thus leading to a greater evaluation time. Clearly, even a literal L 
whose extension cannot be split in s parts can be chosen, provided that L allows for a 
minor (estimated) work for each instantiation task. Moreover, if s 1 = s (L\ allows for s 
splits), it holds that C 1 < C\ for each 1 < i < n; in this case, Li can be chosen without 
computing any cost. 

As an example, suppose that we have to instantiate the constraint :-a(X, Y), b(Y, Z), 
c(Z,X),d(V, Z). Suppose also that the extensions of the body literals are T(a) = 20, 
T(b) = 50, T(c) = 1000, T(d) = 1000, and that the estimations of the costs of instan- 
tiating the first i literals with 1 < i < 4 are the following: C(l) = 0, C(2) = 1000, 
C(3) = 7000, C(4) = 57000. TableQ]shows the estimations C 1 of the works of the instan- 
tiation tasks obtained by the split of the i-th literal with 1 < i < 4, by varying the target 
number s of splits. In particular, the first column shows the target number of splits, the fol- 
lowing four columns show the maximum number of splits s l allowed for each literal, and 
the remaining four columns show the costs C l computed according to the s l values; in bold 
face we outline, for each target number of splits, the minimal values of C l . It can be noted 
that, in our example, increasing the value of s corresponds to different choices of the literal 
to split. Moreover, in each row, the choice is always restricted to the first i literals, where 
the zth literal is the first one allowing for the target number s of splits; indeed, C l < C k , 
for each k > i. Furthermore, even a literal that does not allow for s splits can be chosen; 
this is the case for s — 100, where the chosen literal is b(Y, Z), which allow for 50 splits. 
Notice that the choice of the literal to split may be influenced by the body ordering in some 
cases, which in turn considerably affects the serial evaluation time (which is the amount 
to be divided by parallel evaluation). For example, all body orderings having d(V, Z) as 
first literal have d as chosen literal, since its extension is sufficiently large to allow the four 
target numbers of splits considered. However, if such orderings determine an higher evalu- 
ation cost w.r.t. the body order exploited in the serial evaluation, then the effect of parallel 
evaluation could be overshadowed. Thus, we apply the selection of the literal to split after 
body reordering with a strategy that minimizes the heuristic cost of instantiating the body. 

Summarizing, our heuristics consists in determining an ordering of the body literals 
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Table 1 . Number of splits and costs of the instantiation tasks 



s 


s l 


s 2 


s J 


s 4 


c 1 


C l 


C s 


C 4 


5 


5 


5 


5 


5 


11400 


11400 


12200 


17000 


50 


20 


50 


50 


50 


2850 


1140 


2120 


8000 


100 


20 


50 


100 


100 


2850 


1140 


1560 


7500 


500 


20 


50 


500 


500 


2850 


1140 


1112 


7100 



exploiting the already assessed technique described in (Le one et al. 200 it and splitting the 
first literal in the body if it allows for the target number of splits (without computing any 
cost). Otherwise, the estimations of the costs are determined and the literal allowing for 
the minimum one is chosen. 

5 Load Balancing and Granularity Control 

An advanced parallelization technique has to deal with two important issues that strongly 
affect the performance of a real implementation: load balancing and granularity control. 
Indeed, if the workload is not uniformly distributed to the available processors then the 
benefits of parallelization are not fully obtained; moreover, if the amount of work assigned 
to each parallel processing unit is too small then the (unavoidable) overheads due to cre- 
ation and scheduling of parallel tasks might overcome the advantages of parallel evaluation 
(in corner cases, adopting a sequential evaluation might be preferable). 

As an example, consider the case in which we are running the instantiation of a rule r 
on a two processor machine and, by applying the technique for Single Rule parallelism 
described above, the instantiation of r is divided into two smaller tasks, by partitioning 
the extension of the split predicate of r into two subsets with, approximative^, the same 
size. Then, each of the two tasks will be processed by a thread; and the two threads will 
possibly run separately on the two available processors. For limiting the inactivity time of 
the processors, it would be desirable that the threads terminate their execution almost at the 
same time. Unfortunately, this is not always the case, because subdividing the extension 
of the split predicate into equal parts does not ensure that the workload is equally spread 
among threads. However, if we consider a larger number of splits, a further subdivision of 
the workload would be obtained, and, the inactivity time would be more likely limited. 

Clearly, it is crucial to guarantee that the parallel evaluation of a number of tasks is not 
more time-consuming than their serial evaluation (granularity control); and that an unbal- 
anced workload distribution does not introduce significant delays and limits the overall 
performance (load balancing). 

Granularity Control. Our method for granularity control is based on the use of a heuristic 
value W(r), which acts as a litmus paper indicating the amount of work required for eval- 
uating each rule of the program, and so, its "hardness", just before its instantiation. W(r) 
denotes the value C(n) (see SectionHJ), for each rule r having n body literals. 

At the rules level, rather than assigning each rule to a different thread, a set of rules R is 
determined and assigned to a thread. R is such that the total work for instantiating its rules 
is enough for enjoying the benefits of scheduling a new thread. In practice, R is constructed 
by iterating on the rules of the same component, and stopping if either J2reR W(r) > w seq 
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or when no further rules can be added to R, where w seq is an empirically-determined 
threshold value. At the single rule level, a rule r is scheduled for parallel instantiation 
(i.e. its evaluation can be divided into smaller tasks that can be performed in parallel) if 
W(r) > w seq ; otherwise, for r the third level of parallelism it is not applied at all. 

Note that, for simplifying the presentation of the algorithms in Section [3] we have not 
considered the management of the granularity control in the second level of parallelism, 
which would have added noisy technical details and made the description more involved. 
However, they can be suitably adapted by modifying procedure Rules Jnstantiator (see 
FigQ} in order to build a set of "easy" rules, and by adding a SetOfRules Jnstantiator 
procedure, which instantiates each rule in the set. Note also that, granularity control in the 
third level of parallelism is obtained by setting the number of splits of a given rule to 1 . 

Load Balancing. In our approach load balancing exploits different factors. On the one 
hand, in the case of the evaluation of recursive rules, a dynamic load redistribution of the 
extension of the split literal at each iteration is performed. On the other hand, the extension 
of the split literal is divided by a number which is greater than the number of processors 
(actually, a multiple of the number of processors is enough) for exploiting the preemptive 
multitasking scheduler of the operating system. Moreover, in case of "very hard" rules, a 
finer distribution is performed in the last splits. In particular, when a rule is assessed to be 
"hard" by comparing the estimated work (the value W(r) described above) with another 
empirically-determined threshold (W(r) > w^ard), a finer work distribution (exploiting a 
unary split size) is performed for the last s — n p splits, where s is the number of splits and 
n p is the number of processors. The intuition here is that, if a rule is hard to instantiate then 
it is more likely that its splits are also hard, and thus an uneven distribution of the splits to 
the available processors in the last part of the computation might cause a sensible loss of 
efficiency. Thus, further subdividing the last "hard" splits, may help to distribute in a finer 
way the workload in the last part of the computation. 

6 Experiments 

The parallelization techniques described in the previous sections were implemented in the 
instantiation module of DLV dLeone et al. 2006t . In order to assess the performance of the 
resulting parallel instantiator we carried out an extensive experimental activity, reported 
in this section. In particular, (i) we compared the previous techniques (components and 
rules parallelism) with the new technique (single-rule parallelism); and (ii) we conducted 
a scalability analysis on the instantiator considering the effects of increasing both the num- 
ber of available processors and the size of the instances. Before discussing the results, a 
description of the implemented system and some benchmarks data are given. 

6.1 Implementation in DLV 

We implemented our parallel techniques by extending the instantiator module of DLV. 
The system is implemented in the C++ language by exploiting the Linux POSIX Thread 
API, shipped with the GCC 4.3.3 compiler. The actual implementation of the algorithms 
reported in Section [3] adopts a producer-consumers pattern in which the total number of 
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threads spawned in each of the three levels of parallelism is limited to a fixed number 
which is user-defined. This is obtained by adapting procedures described in Section 13.11 
in such a way that, spawn commands are replaced by push operations in three different 
shared buffers (one for each level of parallelism); moreover, thread joins ensuring the com- 
pletion of a given task (e.g., evaluation of all splits of the same rule) are replaced by proper 
condition statements (e.g. counting semaphores). In this way, worker threads are recycled, 
and continuously pop working tasks from the buffers, up to the end of the instantiation 
process. The main motivation for this technical variant is limiting thread creation over- 
head to a single initialization step. In addition, the implementation allows for separately 
enabling/disabling the three levels of parallelism by command line options. 

6.2 Benchmark Problems and Data 

In our experiments, several well-known combinatorial problems as well as real-world prob- 
lems are considered. These benchmarks have been already used for assessing ASP instan- 
tiator performance dLeone et al. 200 6 ; Gebse ret al. 20071 IDeneckere t al. 2009). Many of 
them are particularly difficult to parallelize due to the compactness of their encodings; note 
that concise encodings are quite common given the declarative nature of the ASP language 
which allows to compactly encode even very hard problems. About data, we considered 
five instances (where the instantiation time is non negligible) of increasing difficulty for 
each problem, except for the Hamiltonian Path and 3-Colorability problem, for which gen- 
erators are available, and we could generate several instances of increasing size. 

In order to meet space constraints, encodings are not presented but they are available, 
together with the employed instances, and the binaries, at http://www.mat.unical.it/ricca/ 
downloads/parallelgroundlO.zip. Rather, to help the understanding of the results, both a 
description of problems and some information on the number of rules of each program is 
reported below. 

n-Queens. The 8-queens puzzle is the problem of putting eight chess queens on an 8x8 
chessboard so that none of them is able to capture any other using the standard chess 
queen's moves. The n-queens puzzle is the more general problem of placing n queens 
on an nxn chessboard (n > 4). The encoding consists of one rule and four constraints. 
Instances were considered having n £ {37, 39, 41, 43, 45}. 

Ramsey Numbers. The Ramsey number ramsey(k, m) is the least integer n such that, no 
matter how the edges of the complete undirected graph (clique) with n nodes are colored 
using two colors, say red and blue, there is a red clique with k nodes (a red fc-clique) or 
a blue clique with m nodes (a blue m-clique). The encoding of this problem consists of 
one rule and two constraints. For the experiments, the problem was considered of deciding 
whether, for k = 7, m = 7, and n £ {31, 32, 33, 34, 35}, n is ramsey(k, m). 
Clique. A clique in an undirected graph G = (V, E) is a subset of its vertices such that 
every two vertices in the subset are connected by an edge. We considered the problem 
finding a clique in a given input graph. Five graphs of increasing size were considered. 
Timetabling. The problem of determining a timetable for some university lectures which 
have to be given in a week to some groups of students. The timetable must respect a number 
of given constraints concerning availability of rooms, teachers, and other issues related to 
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the overall organization of the lectures. Many instances were provided by the University of 
Calabria; they refer to different numbers of student groups g G {15, 17, 19, 21, 23}. 
Sudoku. Given an NxN grid board, where TV is a square number N = M 2 , fill it with 
integers from 1 to N so that each row, each column, and each of the TV MxM boxes 
contains each of the integers from 1 to N exactly once. Suppose the rows are numbered 
1 to N from left to right, and the columns are numbered 1 to N from top to bottom. The 
boxes are formed by dividing the rows from top to bottom every M rows and dividing 
the columns from left to right every M columns. Encoding and instances were used for 
testing the competitors solvers in the ASP Competition 2009 (IDenecker et al. 20091) . For 
assessing our system we considered the instances {sudoku. ins, sudoku.inQ, sudoku.ini ', 
sudoku.in9, sudoku.inlO}, where N = 25. 

Golomb Ruler. A Golomb ruler is an assignment of marks to integer positions along a 
ruler so that no pair of two marks is the same distance from each other. The number of 
marks is the order of the ruler. The first mark is required to be at position 0, the position 
of the highest mark is the length of the ruler. The problem is finding the shortest ruler of 
a given order. Encoding and instances have been used for testing the competitors solvers 
in the ASP Competition 2009 (Deneckeret al. 20091 >. Instances are descibed by a couple 
(m,p) where m is the number marks and p is the number of positions: we considered the 
values (10, 125), (13, 150), (14, 175),(15, 200) and (15, 225). 

Reachability. Given a directed graph G = (V, E), we want to compute all pairs of nodes 
(a, b) G V x V (i) such that b is reachable from a through a nonempty sequence of edges 
in E. The encoding of this problem consists of one exit rule and a recursive one. Five 
trees were generated with a pair (number of levels, number of siblings): (9,3), (7,5), (14,2), 
(10,3) and (15,2), respectively. 

Food. The problem here is to generate plans for repairing faulty workflows. That is, start- 
ing from a faulty workflow instance, the goal is to provide a completion of the workflow 
such that the output of the workflow is correct. Workflows may comprise many activities. 
Repair actions are compensation, (re)do and replacement of activities. A single instance 
was provided related to a workflow containing 63 predicates, 56 components and 1 16 rules. 
3-Colorability. This well-known problem asks for an assignment of three colors to the 
nodes of a graph, in such a way that adjacent nodes always have different colors. The 
encoding of this problem consists of one rule and one constraint. A number of simplex 
graphs were generated with the Stanford GraphBase library dKnuth 19941 ), by using the 
function simplex(n, n, —2, 0, 0, 0, 0), where 80 < n < 250. 

Hamiltonian Path. A classical NP-complete problem in graph theory, which can be ex- 
pressed as follows: given a directed graph G — (V, E) and a node a G V of this graph, 
does there exist a path in G starting at a and passing through each node in V exactly once. 
The encoding of this problem consists of several rules, one of these is recursive. Instances 
were generated, by using a tool by Patrik Simons (cf. (Simons 2000)), with n nodes with 
1000 < n < 12000. 

The machine used for the experiments is a two-processor Intel Xeon "Woodcrest" (quad 
core) 3GHz machine with 4MB of L2 Cache and 4GB of RAM, running Debian GNU 
Linux 4.0. Since our techniques focus on instantiation, all the results of the experimental 
analysis refer only to the instantiation process rather than the whole process of computing 
answer sets; in addition, the time spent before the grounding stage (parsing) is obviously 
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Problem 


serial 


levels 1+2 


level 3 


levels 1+2f3 


queerii 
queeri2 
queens 
queens 
queens 


4.64 (0.00) 

5.65 (0.00) 
6.83 (0.00) 
8.19 (0.00) 
9.96 (0.00) 


2.19(0.06) 
3.29(0.51) 
3.85 (0.50) 
4.50 (0.48) 
4.72(0.11) 


0.71 (0.01) 

0.89 (0.01) 
1.08 (0.00) 
1.25 (0.01) 
1.45(0.02) 


0.69(0.01) 

0.86 (0.02) 
1.03 (0.02) 
1.22(0.00) 
1.43 (0.01) 


ramseyi 
ramseyi 
ramseyz 
ramseyi 
ramseyi 


258.52 (0.00) 
328.68 (0.00) 
414.88 (0.00) 
518.28 (0.00) 
643.65 (0.00) 


131.61(0.23) 
168.03 (1.38) 
211.60(0.55) 
265.58 (2.82) 
327.25 (0.79) 


40.53 (0.21) 
51.35 (0.25) 
68.49(0.10) 
83.32 (0.24) 
103.14(0.44) 


36.23 (0.34) 
46.09 (0.33) 
58.06 (0.20) 
75.19(2.41) 
92.28 (0.65) 


cliquex 
clique^ 
cliques 
clique^ 
clique^ 


16.06 (0.00) 
29.98 (0.00) 
49.11 (0.00) 
78.05 (0.00) 
119.48(0.00) 


15.88 (0.20) 
29.97 (0.01) 
49.18(0.05) 
78.70 (0.35) 
118.66(0.07) 


3.34 (0.04) 
4.41 (0.12) 
7.11 (0.03) 
11.35 (0.14) 
17.08(0.14) 


2.35 (0.01) 
4.34 (0.07) 
7.09 (0.02) 
11.29(0.11) 
17.09(0.16) 


timetabi 
timetabi 
timetabi 
timetabi 
timetabi 


1148 (O.oo) 

17.49 (0.00) 
21.65 (0.00) 
17.75 (0.00) 
23.69 (0.00) 


7.28 (0.04) 
8.30 (0.07) 
10.22 (0.05) 
8.24 (0.04) 
11.09 (0.01) 


2.35 (0.04) 
2.61 (0.01) 
3.22 (0.04) 
2.61 (0.01) 
3.52 (0.01) 


2.29(0.01) 
2.61 (0.02) 
3.20 (0.01) 
2.64 (0.05) 
3.50 (0.03) 


sudokui 
sudoku2 
sudokus 
sudokm 
sudokus 


5.42 (0.00) 
9.87 (0.00) 
10.28 (0.00) 
10.56 (0.00) 
11.08 (0.00) 


4.15 (0.04) 
7.75 (0.01) 
8.01 (0.05) 
8.38 (0.01) 
8.25 (0.04) 


0.98 (0.01) 
1.59 (0.02) 
1.62(0.00) 
1.75 (0.02) 
1.67 (0.02) 


0.88 (0.01) 
1.51 (0.02) 
1.57 (0.01) 
1.63 (0.03) 
1.63 (0.05) 


goljruler\ 
goljruleri 
goljruler^ 
goljruler^ 
goljruler^ 


6.58 (0.00) 
13.74(0.00) 
24.13(0.00) 
40.64 (0.00) 
62.23 (0.00) 


6.32 (0.07) 
12.57 (0.04) 
22.67 (0.05) 
37.50 (0.10) 
59.04 (0.04) 


0.96(0.01) 
1.87 (0.04) 
3.29 (0.06) 
5.44(0.21) 
8.32(0.12) 


0.94 (0.02) 
1.84 (0.09) 
3.25(0.13) 
5.51 (0.10) 
8.36 (0.17) 


reachi 
reach} 
reachz 
reach^ 
reachs 


52.21 (0.00) 
147.34 (0.00) 
258.01 (0.00) 
522.09 (0.00) 
1072.00 (0.00) 


52.07 (0.07) 
148.34(0.01) 
240.17(0.13) 
517.97 (0.59) 
1069.86(1.04) 


8.25 (0.06) 
22.60(0.16) 
39.59 (0.29) 
77.21 (0.20) 
160.66(0.21) 


8.28 (0.01) 
22.67(0.18) 
39.57 (0.44) 
77.52(0.31) 
160.31 (0.25) 


Food 


684.95 (1.19) 


0.18(0.15) 


104.6(1.01) 


0.08 (0.01) 



(a) Average instantiation times in seconds (standard deviation) 




Number of nodes 



(b) Instantiation times(s) - Hamiltonian Path 



(c) Instantiation times(s) - 3-Colorability 



Fig. 3. Impact of parallelization techniques 



the same both for parallel and non-parallel version. In order to obtain more trustworthy 
results, each single experiment was repeated five times and the average of performance 
measures are reported. 



6.3 Effect of Single Rule Parallelism 

In this section we report the results of an experimental analysis aimed at comparing the 
effects of the single rule parallelism with the first two levels. More in detail, we considered 
four versions of the instantiator: (i) serial where parallel techniques are not applied, 
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(ii) levelsi +2 where components and rules parallelism are applied, (Hi) level 3 where 
only the single rule level is applied, and (iv) levelsi +2+3 in which all the three levels 
are applied. Results are shown in Fig.[3fa)(b)(c), where the two graphs report the average 
instantiation times for the Hamiltonian Path problem and 3-Colorability; while the table 
reports the average instantiation times in seconds for the remaining benchmarks!! 

More in detail, in Fig. |3(a)| the first column reports the problem considered, whereas 
the next columns report the results for the four instantiators. Looking at the third column 
in the table, benchmarks can be classified in three different groups according to their be- 
haviors: the benchmarks in which the first two levels of parallelism apply, those where 
these first two levels apply marginally, and those where they do not apply at all. In the 
first group, we have the n-Queens problem, Ramsey Numbers, and Timetabling, where 
levelsi +2 is about twice faster than serial; however, considering that the machine on 
which we ran the benchmarks has eight core available, levels i+2 is not able to exploit all 
the computational power at hand. The reason, is that the encodings of these benchmarks 
either have a small number of rules (n-Queens, Ramsey Numbers), or they show an intrin- 
sic dependency among components/rules (Timetabling), that limits the efficacy of the first 
two levels of parallelism. These considerations explain also the behavior of the other two 
groups of benchmarks. More in detail, for the second group (which contains only Sudoku) 
a small improvement is obtained due to few rules which are evaluated in parallel, while the 
benchmarks belonging to the third group, whose encodings have very few interdependent 
rules (e.g. Reachability), proved hard to parallelize. Looking at the graphs, Hamiltonian 
Path and 3-Colorability clearly belongs to the third group, indeed the lines of serial and 
levelsi +2 overlap. 

A special case is the Food problem, showing an impressive performance, which proved 
to be a case very easy to parallelize. This behavior can be explained by a different schedul- 
ing of the constraints performed by the serial version and the levelsi +2 one. In particular, 
this instance is inconsistent (there is a constraint always violated) and both versions stop 
the computation as soon as they recognize this fact. The scheduling performed by the paral- 
lel version allows the identification of this situation before the serial one, since constraints 
are evaluated in parallel, while the latter evaluates the inconsistent constraint later on. 

Concerning the behavior of level 3 , we notice that it always performs very well (always 
more than 7.5x faster than serial), and in all cases but Food, it outperforms levelsi +2 . 
Basically, the third level of parallelism applies to every single rule, and thus it is effective 
on all problems, even those with very small encodings. In the case of Food, even if level 3 
is about 7x faster than serial, it evaluates rules in the same order than serial thus 
recognizing the inconsistent constraint later than levelsi +2 . 

The good news is that the three levels of parallelism always combine (even in the case 
of Food). This can be easily seen by looking at the last column of table and at the two 
graphs. Note that most of the advantages are due to the third level of parallelism. Indeed 
in the graphs, the lines for level 3 and levelsi +2+3 overlap, and levels 1+2+3 shows only 
marginal gains w.r.t. level 3 , in the benchmarks where levels 1+2 applies. 



5 Results in form of tables for Hamiltonian Path and 3-Colorability are reported in Appendix B. 
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Fig. 4. Hamiltonian Path and 3-Colorability: average instantiation times (s) and efficiency 



6.4 Scalability of the Approach 

We conducted a scalability analysis on the instantiator levelsi +2 +3 which exploits all the 
three parallelism levels. Moreover, we considered the effects of increasing both the size 
of the instances and the number of available processors (from 1 up to 8 CPUs)| The 
results of the analysis are summarized in Table [2] and Figure [4] where both the average 
instantiation times and the efficiencies are reported. As before, the graphs show results 
for Hamiltonian Path and 3-Colorability, while the results of the remaining problems are 
reported in the table. The overall picture is very positive: the performance of the instantiator 
is very good in all cases and average efficiencies vary in between 0.85 and 0.95 when all the 
available CPUs are enabled. As one would expect, the efficiency of the system both slightly 
decreases when the number of processors increases -still remaining at a good level- and 
rapidly increases going from very small instances (<2 seconds) to larger ones. 

The granularity control mechanism resulted to be effective in the n-Queens problem, 
where all the considered instances required less than 10 seconds of serial execution time. 
Indeed, the "very easy" disjunctive rule was always sequentially-evaluated in all the cases. 
Since the remaining constraints are strictly determined by the result of the evaluation of the 
disjunctive rule, the unavoidable presence of a sequential part limited the final efficiency to 
a remarkable 0.9 in the case of 8 processors. A similar scenario can be observed in the case 
of Ramsey Numbers, where the positive impact of the load balancing and granularity con- 

6 Available processors can be disabled (respectively enabled) by running the bash Linux commands: 
echo >> /sys/devices/system/cpu/cpu-n/online (resp. echo 1 >> /sys/devices/system/cpu/cpu-n/online). 
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trol heuristics becomes very evident. In fact, since the encoding is composed of few "very 
easy" disjunctive rules and two "very hard" constraints, the heuristics selects a sequential 
evaluation for the rules, and dynamically applies the finer distribution of the last splits for 
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the constraints. As a result, the system produces a well-balanced work subdivision, that 
allows for obtaining steady results with an average efficiency greater of almost equal to 0.9 
in all tested configurations. Analogously for Clique, which has a short encoding consisting 
of only three easy rules, for which granularity control schedules a serial execution, and one 
"hard" constraint which can be split and thus evaluated in parallel. 

A good performance is also obtained in the case of Reachability. This problem is made 
up of only two rules; the first one is caught by granularity control which schedules its 
serial execution. The second one is a heavy recursive rule, that requires several iterations 
to be grounded. In this case a good load balancing is obtained thanks to the redistributions 
applied (with possibly different split sizes) at each iteration of the semi naive algorithm. 

The instantiator is effective also in Golomb Ruler, Timetabling and Sudoku where the 
performance results to be good also thanks to a well-balanced workload distribution. 

About Food, a super-linear speedup (due to the first levels of parallelism) is already ev- 
ident with two-processors and efficiency peaks when three processors are enabled, where 
the execution times becomes negligible. The behavior of the system for instances of vary- 
ing sizes was analyzed in more detail in the case of Hamiltonian Path and 3-Colorability; 
this was made possible by the availability of generators. 

Looking at Figures |4(a)| and |4(c)| it is evident that the efficiency of the system rapidly 
reaches a good level (ranging from 0.9 up to 1), moving from small instances (requiring 
less than 2s) to larger ones, and remains stable (the surfaces are basically plateaux). The 
corresponding gains are visible by looking at Figures |4(b)| and |4(d)| where, e.g. an Hamil- 
tonian Path (3-Colorability) instance is evaluated in 332.78s (965.36s) by the serial system, 
and requires only 68.26s (124.70s) with levelsi +2 +3 with 8-processor enabled. 

Summarizing, the parallel instantiator behaved very well in all the considered instances. 
It showed superlinear speedups in the case of easy-to-parallelize instances and, in the other 
cases its efficiency rapidly reaches good levels and remains stable when the sizes of the 
input problem grow. Importantly, the system offers a very good performance already when 
only two CPUs are enabled (i.e. for the largest majority of the commercially-available 
hardware at the time of this writing), and efficiency remains at a very good level when up 
to 8 CPUs are available. 



7 Related Work 

Several works about parallel techniques for the evaluation of ASP programs have been 
proposed, focusing on both the propositional (model search) phase dFinkel et al. 20011 
lEllguth et al. 2009| IGressmann et al. 20051 IPontelli and El-Khatib 20011 and the instan- 
tiation phase (Baldu ccini et al. 20051 ICa limeri et al. 2008). Model generation is a distinct 
phase of ASP computation, carried out after the instantiation, and thus, the first group of 
proposals is not directly related to our setting. Concerning the parallelization of the instan- 
tiation phase, some preliminary studies were carried out in (Bal duccini et al. 20 05). as one 
of the aspects of the attempt to introduce parallelism in non-monotonic reasoning systems. 
However, there are crucial differences with our system regarding both the employed tech- 
nology and the supported parallelization strategy. Indeed, our system is implemented by 
using POSIX threads APIs, and works in a shared memory architecture (Stallings 1998), 



while the one described in (Balduccini et al. 2005) is actually a Beowulf cluster working 
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in local memory. Moreover, the parallel instantiation strategy of (Bal duccini et al. 20051) is 
applicable only to a subset of the program rules (those not defining domain predicates), 
and is, in general, unable to exploit parallelism fruitfully in the case of programs with a 
small number of rules. Importantly, the parallelization strategy of (Baldu ccini et al. 20051 1 
statically assigns a rule per processing unit; whereas, in our approach, both the extension 
of predicates and split sizes are dynamically computed (and updated at different itera- 
tions of the semi-naive evaluation) while the instantiation process is running. Note also 
that our parallelization techniques could be adapted for improving other ASP instantia- 
tors like Lparse (Niemela and Sim ons 19971 1 and Gringo (Gebs eret al. 2007i l. Concerning 
other related works, it is worth remembering that, the Single Rule parallelism employed 
in our system is related to the copy and constrain technique for parallelizing the eval- 
uation of deductive databases (Wolfson and Silberschatz 1988; Wol f son and Ozeri 19901 
Ganguly et al. 1990 Zhang et al. 1995 Dewa net al. 19941 ). In many of the mentioned works 



(dating back to 90's), only restricted classes of Datalog programs are parallelized; whereas, 
the most general ones (reported in (Wol fson and Ozeri 19901 Zhang et al. 1995)) are appli- 
cable to normal Datalog programs. Clearly, none of them consider the peculiarities of dis- 
junctive programs and unstratified negation. More in detail, (Wo lfson and Ozeri 1990b pro- 
vides the theoretical foundations for the copy and constrain technique, whereas (Zhang et al. 1995) 
enhances it in such a way that the network communication overhead in distributed systems 
can be minimized. The copy and constrain technique works as follows: rules are replicated 
with additional constraints attached to each copy; such constraints are generated by exploit- 
ing a hash function and allow for selecting a subset of the tuples. The obtained restricted 
rules are evaluated in parallel. The technique employed in our system shares the idea of 
splitting the instantiation of each rule, but has several differences that allow for obtaining an 
effective implementation. Indeed, in (Wolfs on and Ozeri 19 90, Zhang et al. 1995 1 copied 
rules are generated and statically associated to instantiators according to an hash function 
which is independent of the current instance in input. In contrast, in our technique, the 
distribution of predicate extensions is performed dynamically, before assigning the rules to 
instantiators, by taking into account the "actual" predicate extensions. In this way, the non- 
trivial problem (Zhang et al. 1995) of choosing an hash function that properly distributes 
the load is completely avoided in our approach. Moreover, the evaluation of conditions at- 
tached to the rule bodies during the instantiation phase would require to modify either the 
standard instantiation procedure (for efficiently selecting the tuples from the predicate ex- 
tensions according to added constraints) or to incur a possible non negligible overhead due 
to their evaluation. Focusing on the heuristics employed on parallel databases, we men- 
tion dDewan et al. 1994] ) and ( |Carey and Lu 1986|. In dDewan et al. 19941 ) a heuristics is 
described for balancing the distribution of load in the parallel evaluation of PARULEL, a 
language similar to Datalog. Here, load balancing is done by a manager server that records 
the execution times at each site, and exploits this information for distributing the load 
according to predictive dynamic load balancing (PDLB) protocols that "update and reor- 
ganize the distribution of workload at runtime by modifying the restrictions on versions 
of the rule pro gram" dDewan et al. 1 994). In (Carey and Lu 1986) the proposed heuristics 
were devised for both minimizing communication costs and choosing an opportune site for 
processing sub-queries among various network-connected database systems. In both cases, 
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the proposed heuristics were devised and tuned for dealing with data distributed in several 
sites and their application to other architectures might be neither viable nor straightforward. 

8 Conclusions 

In this paper we present some advanced techniques for the parallel instantiation of ASP 
programs, and a parallel ASP instantiator based on the DLV system. In particular, we have 
proposed and implemented a three-level parallelization technique, dynamic load balanc- 
ing, and granularity control strategies. An experimental analysis outlines significant per- 
formance improvements, larger applicability w.r.t. existing approaches, as well as nearly 
optimal efficiency and steady scalability of the implemented instantiator. 

As far as future work is concerned, we are studying other techniques for further ex- 
ploiting parallelism in ASP systems, considering also the other phases of the computation. 
Automatic determination of heuristics thresholds is also under investigation. 
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Appendix A Splitting the Extension of a Literal 

In Figure Al is detailed an implementation of procedure SplitExtension, which plays a cen- 
tral role in the single -rule parallelization algorithms presented in Section 3.2. In particular, 
function SplitExtension partitions the extension of I (stored in S and AS) into n splits. In 
order to avoid useless copies, each split is virtually identified by means of iterators over S 
and AS, representing ranges of instances. 

Procedure SplitExtension (I: Literal; ndnteger; S: SetOfAtoms; AS: SetOfAtoms, 
var vector<VirtualSplit> V) 

begin 

integer size:= [ (S.size() + AS.sizeO)/ n J ; 
integer i:= 0; Atomslterator it := 5.begin(); 

while i < [S.size()/size\ do // possibly, build splits with atoms from S 

l / [i].SetIterators_S(it, it+size); it := it + size; i=i+\; 
end while 

if it < S.endO then // possibly, build a split mixing S and AS atoms 
\/[i].SetIterators_S(it, S.end()); it := A5.begin(); 
integer k := size - Size(V[i]); 
if AS.sizeCX k 

l / [i].SetIterators.AS' (it, A5.end()); it = A5.end(); 

else 

V[i].Setiterators_AS' (it, it+k); it = it+k; i = i+l; 
while i < L(5.size()+A5.size())/sizeJ do // possibly, build splits with atoms from AS 

V r [i].SetIterators_A5' (it, it+size); it := it + size; i=i+l; 
end while 

if it < AS.endO then 

V r [i].SetIterators_A5' (it, A5.end()); 
end Procedure 

Fig. A 1. Splitting the extension of a literal. 

More in detail, for each split, an instance of VirtualSplit is created containing two iterators 
over S (resp. AS), namely S -begin and S.end (resp. AS -begin, AS -end), indicating the 
instances of I from S (resp. AS) that belong to the split. The procedure starts by building 
splits with atoms from S; then, it proceeds by considering atoms from AS. Note that, in 
general, a split may mix ground atoms from both S and AS. 
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Appendix B Detailed Results: 3Colorability and Hamiltonian Path 

Tables IBTI and IB 2l contain detailed results for the benchmarks 3Colorability and Hamilto- 
nian Path. We recall that, in the case of 3Colorability, were generated 18 simplex graphs by 
means of the Stanford GraphBase library using the function simplex(n, n, —2, 0, 0, 0, 0), 
where 80 < n < 250; whereas, for the Hamiltonian Path benchmark, 14 graphs were 
generated by using a tool by Patrik Simons, with n nodes with 1000 < n < 12000. 

In detail, Table UTTI reports the results of an experimental analysis aimed at comparing 
the effects of the single rule parallelism with the first two levels. The first column reports 
the problem considered, whereas the next columns report the results for four version of the 
instantiator: (i) serial where parallel techniques are not applied, (m) levelsi +2 where 
components and rules parallelism are applied, (Hi) level 3 where only the single rule level 
is applied, and (iv) levelsi +2 +3 in which all the three levels are applied. 

Table lB~2l reports the results of a scalability analysis on the instantiator levelsi+j^, 
which exploits all the three parallelism levels. In particular, both the average instantiation 
times and the efficiencies are reported by considering the effects of increasing both the size 
of the instances and the number of available processors (from 1 up to 8 CPUs). 
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Problem 


serial 


levelsi-f2 


levels 


Ievelsi-f2f3 


3 - Coh 


8.61 (0.00) 


8.55 (0.09) 


1.27 (0.03) 


1.26 (0.03) 


3 - Coh 


14.16(0.00) 


13.70 (0.04) 


1.90 (0.01) 


1.90 (0.01) 
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2.79 (0.03) 
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3 - Coh 
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57.08 (0.24) 
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78.64 (0.00) 
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59.86 (2.46) 
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71.35 (0.44) 
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205.60 (0.39) 


29.23 (0.07) 
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47.68 (0.03) 


HPn 
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57.66 (1.05) 
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492.28 (1.68) 


68.95 (0.56) 
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Table B 1. 3Colorability and Hamiltonian Path results: Average instantiation times in sec- 
onds (standard deviation) 
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^Ol O A /A AM\ 

281.29 (0.00) 


1 ^ AA /O OA\ 

143.00 (2.39) 


AO A^ /A 00\ 

98.02 (0.38) 


to z:"o /a ^ \ 

73.68 (0.84) 


rA AO /A TT\ 

59.08 (0.27) 


Af\ C ZT /A TO\ 

49.56 (0.72) 


42.73 (0.17) 


OT OO /A 00\ 

37.33 (0.28) 


0.98 


0.96 


0.95 


0.95 


0.95 


0.94 


0.94 


3 


- Col 13 


O A~I AT /A A/Y\ 

347.97 (0.00) 


1 TO AO / A A7\ 

178.98 (4.07) 


100 ia /o /""n\ 

123.22 (2.62) 


AA AA / 1 TA\ 

90.09 (1.79) 


71.84 (1.48) 


zT 1 A O /O 1 O \ 

61.42 (2.13) 


CO O O /A CA\ 

53.32 (0.52) 


lr OO /I 10\ 

45.88 (1.12) 


0.97 


0.94 


0.97 


0.97 


0.94 


0.93 


0.95 


3 


— C0I14 


,1 OA OO /A AA\ 

420.88 (0.00) 


OO/I OC /O OO^ 

224.85 (3.82) 


1 CT OT ("O AT\ 

152.8/ (2.0/) 


IIOOO l(\ 

1 13.88 (0.92) 


92.51 (1.66) 


1£L A£ /A t"7\ 

/6.46 (0.5/) 


ft. CO tf\ A A\ 

66.68 (0.40) 


CO OA /A T/1 ^ 

58.39 (0. /4) 


0.94 


0.92 


0.92 


0.91 


0.92 


0.90 


0.90 


q 
O 


1 ■, ,1 

— Col\§ 


rTQ OA /A AA\ 

528.20 (0.00) 


T7 oz /C ZTT\ 

278.26 (5.67) 


1 O •! n /I /I 0\ 

184.32 (1.48) 


IOO OA /I *TO \ 

138.84 (1.63) 


111 OA /n ro\ 

1 1 1.29 (2.58) 


A A OC /A AT\ 

94.25 (0.97) 


OA CO /A O A \ 

80.58 (0.24) 


T 1 O C /A /I /I \ 

71.35 (0.44) 


0.95 


0.96 


0.95 


0.95 


0.93 


0.94 


0.93 


3 


- coll6 


fi73 94 en nni 


331 5S C4 1 9"! 


994 fin" 57*1 


l fiQ qn C7 791 


1 37 37 CI 341 


1 1 1 57 (2 10) 


07 SO CO 

7 / . J\) y\). ju ^ 


Sfi 4fi fA fi"^ 


1 02 


1 00 


99 


98 


1 01 


99 


97 


3 


- Col 17 


786.43 (0.00) 


403.02 (5.80) 


263.37 (2.69) 


207.01 (5.21) 


158.80 (5.81) 


135.89 (1.64) 


116.68 (3.89) 


102.44 (3.54) 


0.98 


1.00 


0.95 


0.99 


0.96 


0.96 


0.96 


3 


- Col 18 


965.36 (0.00) 


473.35 (4.88) 


312.04(6.90) 


238.99(4.15) 


190.61 (1.65) 


159.41 (4.64) 


140.46(1.49) 


124.70 (1.08) 


1.02 


1.03 


1.01 


1.01 


1.01 


0.98 


0.97 




HPl 


3.50 (0.00) 


1.82(0.01) 


1 .27 (0.00) 


1.00(0.02) 


0.80 (0.00) 


0.70 (0.00) 


0.63 (0.00) 


0.57 (0.01) 


0.96 


0.92 


0.88 


0.88 


0.83 


0.79 


0.77 




HP 2 


13.24 (0.00) 


6.84 (0.05) 


4.64 (0.01) 


3.59 (0.05) 


2.90 (0.04) 


2.48 (0.01) 


2.19(0.06) 


1.97 (0.04) 


0.97 


0.95 


0.92 


0.91 


0.89 


0.86 


0.84 




HP 3 


29.28 (0.00) 


15.63 (0.28) 


10.41 (0.26) 


7.90 (0.19) 


6.30 (0.02) 


5.52 (0.22) 


4.67 (0.03) 


4.18 (0.05) 


0.94 


0.94 


0.93 


0.93 


0.88 


0.90 


0.88 




HP 4. 


51.80(0.00) 


26.55 (0.03) 


18.05 (0.02) 


13.82(0.16) 


11.14 (0.04) 


9.52 (0.14) 


8.22 (0.05) 


7.33 (0.03) 


0.98 


0.96 


0.94 


0.93 


0.91 


0.90 


0.88 




HP S 


80.87 (0.00) 


42.60(0.18) 


28.68 (0.38) 


21.61 (0.05) 


17.56 (0.12) 


15.07 (0.13) 


13.05 (0.01) 


11.56 (0.09) 


0.95 


0.94 


0.94 


0.92 


0.89 


0.89 


0.87 




HP 6 


212.84 (0.00) 


110.90(0.48) 


74.50 (0.56) 


56.16(0.26) 


45.62 (0.35) 


38.10(0.42) 


33.46(0.11) 


29.57 (0.10) 


0.96 


0.95 


0.93 


0.92 


0.91 


0.89 


0.87 




HP 7 


274.43 (0.00) 


141.00(0.59) 


94.69 (0.84) 


72.07(0.11) 


58.30 (0.35) 


48.86(0.15) 


42.67(0.21) 


37.14 (0.29) 


0.97 


0.96 


0.94 


0.94 


0.92 


0.90 


0.90 




HPg 


160.79 (0.00) 


82.52 (0.35) 


56.06 (0.23) 


42.64 (0.34) 


34.09(0.11) 


29.26 (0.05) 


25.50 (0.32) 


22.45 (0.16) 


0.96 


0.95 


0.95 


0.93 


0.93 


0.91 


0.90 




HP 9 


343.05 (0.00) 


176.40 (0.89) 


118.84(0.47) 


89.88 (0.53) 


73.53 (0.19) 


61.61 (0.43) 


53.33 (0.10) 


47.68 (0.03) 


0.97 


0.97 


0.95 


0.94 


0.94 


0.92 


0.92 




HP W 


117.16 (0.00) 


60.84 (0.29) 


41.05(0.49) 


31.40(0.32) 


25.55 (0.13) 


21.43 (0.17) 


18.81 (0.08) 


16.77 (0.23) 


0.97 


0.96 


0.95 


0.93 


0.93 


0.92 


0.90 




HPn 


422.72 (0.00) 


218.38 (0.51) 


146.26 (0.38) 


1 10.34 (0.56) 


89.77 (0.19) 


75.15 (0.25) 


66.01 (0.59) 


57.66 (1.05) 


0.97 


0.96 


0.96 


0.94 


0.94 


0.91 


0.92 




HP 12 


510.15 (0.00) 


261.06 (2.47) 


173.57 (1.93) 


132.88 (0.21) 


107.51 (0.63) 


90.66 (0.80) 


78.44 (0.44) 


70.26 (0.38) 


0.98 


0.98 


0.96 


0.95 


0.94 


0.93 


0.91 



Table B 2. 3Colorability and Hamiltonian Path results: average instantiation times in seconds (standard deviation), efficiency 
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Appendix C Application of the Single Rule Parallelism 

In the following is reported a complete example of the application of the single rule paral- 
lelism for computing in parallel the instantiation of a rule. Consider the following program 
V encoding the 3-Colorability problem: 

ir) col(X,red) V col(X , yellow) V col(X, green) :- node(X). 
(c) :- edge{X,Y),col(X,C), col{Y,C). 

Assume that, after the instantiation of rule r, the extensions of predicate node and pred- 
icate col, are the ones reported in Table IC fl and that the extension of the predicate edge is 
the one reported in Table IC2| 





Predicates extension 


1 


node(a) 


2 


node(b) 


3 


node(c) 


4 


node(d) 


1 


col(a, red) 


2 


col(a, yellow) 


3 


col(a, green) 


4 


col(b, red) 


5 


col(b, yellow) 


6 


col(b, green) 


7 


col(c, red) 


8 


colic, yellow) 


9 


colic, green) 


10 


colid, red) 


11 


col(d, yellow) 


12 


colid, green) 



Table C 1. Extension of the predicate node and col. 



Suppose now that the heuristics suggests to perform the single rule level of parallelism 
for the instantiation of the constraint (c), and suppose that the extension of predicate edge 
is split in two. Then, the extension of the predicate edge is partitioned into two subsets 
which appear divided by an horizontal line in Table IC 2l The instantiation of constraint (c) 
is carried out in parallel by two separate processes, say p\, and p2, which will consider as 
extension of edge, respectively, the two splits depicted in Table IC 21 Process pi produces 
the following ground constraints: 

■-edgeia, b), col(a, red), col(b, red). 
\-edgeia, b), colia, yellow), col(b, yellow). 
:-edge(a,b), colia, green), colib, green). 
:-edge(b, c), col(a, red), col(b, red). 
:-edge(b, c), col(a, yellow), col(b, yellow). 
:-edge(b, c), col(a, green), col(b, green). 
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Predicate extension 


1 


edge(a, b) 


2 


edge(b, c) 


3 


edge(b, d) 


4 


edge(c, d) 



Table C 2. Extension of the predicate edge. 



whereas, process p 2 produces the following ground constraints: 

-edge(b, d), col(b, red), col(d, red). 
-edge(b, d), col(b, yellow), col(d, yellow). 
-edge(b, d), col(b, green), col(d, green), 
-edgeic, d), col(c, red), col(d, red), 
-edgeic, d), colic, yellow), col(d, yellow). 
-edge(c, d), colic, green), colid, green). 



Tecnically speaking, this is obtained by a call to procedure SplitExtension described 



in 



Appendix A The procedure will create two virtual splits, say V\ and V2, with: 



Vi.Sbegin = edge(a,b) 
Vi.S en d = edge{b, c) 
V 2 .S begin = edge(b,d) 
Vi-Send = edgeic, d). 



Vi.ASbegin — -L 

Vi.AS end = ± 

V2.AS en d = -L 



where _L indicates a null iterator (usually indicating an iterator that has moved after the 
end of a container), which, in this case, it is used to represent that that no split is created 
containing instances from AS. 



